25 research outputs found

    Array Configuration-Agnostic Personal Voice Activity Detection Based on Spatial Coherence

    Full text link
    Personal voice activity detection has received increased attention due to the growing popularity of personal mobile devices and smart speakers. PVAD is often an integral element to speech enhancement and recognition for these applications in which lightweight signal processing is only enabled for the target user. However, in real-world scenarios, the detection performance may degrade because of competing speakers, background noise, and reverberation. To address this problem, we proposed to use equivalent rectangular bandwidth ERB-scaled spatial coherence as the input feature to train an array configuration-agnostic PVAD network. Whereas the network model requires only 112k parameters, it exhibits excellent detection performance and robustness in adverse acoustic conditions. Notably, the proposed ARCA-PVAD system is scalable to array configurations. Experimental results have demonstrated the superior performance achieved by the proposed ARCA-PVAD system over a baseline in terms of the area under receiver operating characteristic curve and equal error rate.Comment: Accepted by INTER-NOISE 2023. arXiv admin note: text overlap with arXiv:2211.0874

    Deep Beamforming for Speech Enhancement and Speaker Localization with an Array Response-Aware Loss Function

    Full text link
    Recent research advances in deep neural network (DNN)-based beamformers have shown great promise for speech enhancement under adverse acoustic conditions. Different network architectures and input features have been explored in estimating beamforming weights. In this paper, we propose a deep beamformer based on an efficient convolutional recurrent network (CRN) trained with a novel ARray RespOnse-aWare (ARROW) loss function. The ARROW loss exploits the array responses of the target and interferer by using the ground truth relative transfer functions (RTFs). The DNN-based beamforming system, trained with ARROW loss through supervised learning, is able to perform speech enhancement and speaker localization jointly. Experimental results have shown that the proposed deep beamformer, trained with the linearly weighted scale-invariant source-to-noise ratio (SI-SNR) and ARROW loss functions, achieves superior performance in speech enhancement and speaker localization compared to two baselines.Comment: 6 page

    Noise Diagnostics of Scooter Faults by Using MPEG-7 Audio Features and Intelligent Classification Techniques

    No full text
    [[abstract]]A scooter fault diagnostic system that makes use of feature extraction and intelligent classification algorithms is presented in this paper. Sound features based on MPEG (Moving Picture Experts Group)-7 coding standard and several other features in the time and frequency domains are extracted from noise data and preprocessed prior to classification. Classification algorithms including the Nearest Neighbor Rule (NNR), the Artificial Neural Networks (ANN), the Fuzzy Neural Networks (FNN), and the Hidden Markov Models (HMM) are employed to identify and classify scooter noise. A training phase is required to establish a feature space template, followed by a test phase in which the audio features of the test data are calculated and matched to the feature space template. The proposed techniques were applied to classify noise data due to various kinds of scooter fault, such as belt damage, pulley damage, etc. The results reveal that the performance of methods is satisfactory, while varying slightly in performance with the algorithm and the type of noise used in the tests

    Analysis and DSP Implementation of a Broadband Duct ANC System Using Spatially Feedforward Structure

    Get PDF
    [[abstract]]The active control technique for broadband attenuation of noise in ducts, using spatially feedforward structure, is investigated from the viewpoints of both acoustic analysis and control engineering. According to the previous work by Munjal and Eriksson [1], there exists an ideal controller for this problem. The ideal controller is a function of the finite source impedance and is thus independent of the boundary conditions. Despite the simplicity, the ideal controller cannot be practically implemented due to the difficulty of calibration of electro-mechanical parameters. To overcome the problem, the controller is implemented via an equivalent formulation modified from the controller originally proposed by Roure [2]. The modified controller is implemented on a DSP platform, using a FIR filter, an IIR filter and a hybrid filter. The experimental results showed that the system achieved 17.2 dB maximal attenuation in the frequency band 300~600 Hz. Physical insights and design considerations in implementation phase are also discussed in the paper

    Learning-based robust speaker counting and separation with the aid of spatial coherence

    No full text
    Abstract A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a time-frequency bin and the global activity function-weighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activity-driven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlapping speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations
    corecore